{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Keras ELMo Tutorial\n",
    "\n",
    "This IPython Notebook contains an example how the ELMo embeddings from the paper *[Deep contextualized word representations (Peters et al., 2018)](http://arxiv.org/abs/1802.05365)* can be used for document classification. \n",
    "\n",
    "As the computation of the embeddings is computationally expensive, we will include it into a preprocessing step:\n",
    "* We read in the dataset (here the IMDB dataset)\n",
    "* Text is tokenized and truncated to a fix length\n",
    "* Each text is fed as a sentence to the AllenNLP ElmoEmbedder to get a 1024 dimensional embedding for each word in the document\n",
    "* These embeddings are then fed to our neural network that we train\n",
    "\n",
    "Computing the embeddings once in the pre-processing significantly reduces the overall computational time. Otherwise, they would be computed for each epoch. However, this requires that enough memory is available, as our transformed dataset will constist of $\\text{number_of_tokens} \\cdot 1024$ float32 numbers.\n",
    "\n",
    "\n",
    "**Note:** Our simple tokenization process ignores sentence boundaries and the complete document is fed as one single sentence to the ElmoEmbedder. As the ELMo embeddings are defined sentence wise, it would be better to first identify the sentences in a document and to process the document sentence by sentence to get the correct embeddings.\n",
    "\n",
    "**Requirements**:\n",
    "* Python 3.6 - lower versions of Python do not work\n",
    "* AllenNLP 0.5.1 - to compute the ELMo representations\n",
    "* Keras 2.2.0 - For the creation of BiLSTM-CNN-CRF architecture\n",
    "\n",
    "\n",
    "See [https://github.com/UKPLab/elmo-bilstm-cnn-crf](https://github.com/UKPLab/elmo-bilstm-cnn-crf) how to install these requirements.\n",
    "\n",
    "# Code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/reimers/anaconda3/envs/ELMo/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n",
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "import keras\n",
    "import os\n",
    "import sys\n",
    "from allennlp.commands.elmo import ElmoEmbedder\n",
    "import numpy as np\n",
    "import random\n",
    "from keras.models import Sequential\n",
    "from keras.layers import Dense, Conv1D, GlobalMaxPooling1D, GlobalAveragePooling1D, Activation, Dropout"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load all files from a directory into dictionaries\n",
    "def load_directory_data(directory, label):\n",
    "    data = []\n",
    "    for file_path in os.listdir(directory):\n",
    "        with open(os.path.join(directory, file_path), \"r\") as f:\n",
    "            data.append({\"text\": f.read().replace(\"<br />\", \" \"), \"label\": label})\n",
    "    return data\n",
    "\n",
    "# Load the positive and negative examples from the dataset\n",
    "def load_dataset(directory):\n",
    "    pos_data = load_directory_data(os.path.join(directory, \"pos\"), 1)\n",
    "    neg_data = load_directory_data(os.path.join(directory, \"neg\"), 0)\n",
    "    return pos_data+neg_data\n",
    "\n",
    "# Download and process the IMDB dataset\n",
    "def download_and_load_datasets(force_download=False):\n",
    "    dataset = keras.utils.get_file(\n",
    "      fname=\"aclImdb.tar.gz\", \n",
    "      origin=\"http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\", \n",
    "      extract=True)\n",
    "\n",
    "    train_data = load_dataset(os.path.join(os.path.dirname(dataset), \"aclImdb\", \"train\"))\n",
    "    test_data = load_dataset(os.path.join(os.path.dirname(dataset), \"aclImdb\", \"test\"))\n",
    "\n",
    "    return train_data, test_data\n",
    "\n",
    "train_data, test_data = download_and_load_datasets()\n",
    "\n",
    "random.shuffle(train_data)\n",
    "random.shuffle(test_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Tokenize text. Note, it would be better to first split it into sentences.\n",
    "def tokenize_text(documents, max_tokens):\n",
    "    for document in documents:\n",
    "        document['tokens'] = keras.preprocessing.text.text_to_word_sequence(document['text'], lower=False)\n",
    "        document['tokens'] = document['tokens'][0:max_tokens]\n",
    " \n",
    "max_tokens = 100\n",
    "tokenize_text(train_data, max_tokens)\n",
    "tokenize_text(test_data, max_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      ":: Lookup of 1000 ELMo representations. This takes a while ::\n",
      "100%[==================================================] 1000/1000 sentences\n",
      "\n",
      ":: Lookup of 1000 ELMo representations. This takes a while ::\n",
      "100%[==================================================] 1000/1000 sentences"
     ]
    }
   ],
   "source": [
    "# Lookup the ELMo embeddings for all documents (all sentences) in our dataset. Store those\n",
    "# in a numpy matrix so that we must compute the ELMo embeddings only once.\n",
    "def create_elmo_embeddings(elmo, documents, max_sentences = 1000):\n",
    "    num_sentences = min(max_sentences, len(documents)) if max_sentences > 0 else len(documents)\n",
    "    print(\"\\n\\n:: Lookup of \"+str(num_sentences)+\" ELMo representations. This takes a while ::\")\n",
    "    embeddings = []\n",
    "    labels = []\n",
    "    tokens = [document['tokens'] for document in documents]\n",
    "    \n",
    "    documentIdx = 0\n",
    "    for elmo_embedding in elmo.embed_sentences(tokens):  \n",
    "        document = documents[documentIdx]\n",
    "        # Average the 3 layers returned from ELMo\n",
    "        avg_elmo_embedding = np.average(elmo_embedding, axis=0)\n",
    "             \n",
    "        embeddings.append(avg_elmo_embedding)        \n",
    "        labels.append(document['label'])\n",
    "            \n",
    "        # Some progress info\n",
    "        documentIdx += 1\n",
    "        percent = 100.0 * documentIdx / num_sentences\n",
    "        line = '[{0}{1}]'.format('=' * int(percent / 2), ' ' * (50 - int(percent / 2)))\n",
    "        status = '\\r{0:3.0f}%{1} {2:3d}/{3:3d} sentences'\n",
    "        sys.stdout.write(status.format(percent, line, documentIdx, num_sentences))\n",
    "        \n",
    "        if max_sentences > 0 and documentIdx >= max_sentences:\n",
    "            break\n",
    "            \n",
    "    return embeddings, labels\n",
    "\n",
    "\n",
    "elmo = ElmoEmbedder(cuda_device=1) #Set cuda_device to the ID of your GPU if you have one\n",
    "train_x, train_y = create_elmo_embeddings(elmo, train_data, 1000)\n",
    "test_x, test_y  = create_elmo_embeddings(elmo, test_data, 1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape Train X: (1000, 100, 1024)\n",
      "Shape Test Y: (1000, 100, 1024)\n"
     ]
    }
   ],
   "source": [
    "# :: Pad the x matrix to uniform length ::\n",
    "def pad_x_matrix(x_matrix):\n",
    "    for sentenceIdx in range(len(x_matrix)):\n",
    "        sent = x_matrix[sentenceIdx]\n",
    "        sentence_vec = np.array(sent, dtype=np.float32)\n",
    "        padding_length = max_tokens - sentence_vec.shape[0]\n",
    "        if padding_length > 0:\n",
    "            x_matrix[sentenceIdx] = np.append(sent, np.zeros((padding_length, sentence_vec.shape[1])), axis=0)\n",
    "\n",
    "    matrix = np.array(x_matrix, dtype=np.float32)\n",
    "    return matrix\n",
    "\n",
    "train_x = pad_x_matrix(train_x)\n",
    "train_y = np.array(train_y)\n",
    "\n",
    "test_x = pad_x_matrix(test_x)\n",
    "test_y = np.array(test_y)\n",
    "\n",
    "print(\"Shape Train X:\", train_x.shape)\n",
    "print(\"Shape Test Y:\", test_x.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train on 1000 samples, validate on 1000 samples\n",
      "Epoch 1/10\n",
      "1000/1000 [==============================] - 12s 12ms/step - loss: 0.7723 - acc: 0.5850 - val_loss: 0.5868 - val_acc: 0.6920\n",
      "Epoch 2/10\n",
      "1000/1000 [==============================] - 1s 841us/step - loss: 0.3291 - acc: 0.8680 - val_loss: 0.4970 - val_acc: 0.7570\n",
      "Epoch 3/10\n",
      "1000/1000 [==============================] - 1s 863us/step - loss: 0.1053 - acc: 0.9870 - val_loss: 0.5948 - val_acc: 0.7350\n",
      "Epoch 4/10\n",
      "1000/1000 [==============================] - 1s 900us/step - loss: 0.0247 - acc: 1.0000 - val_loss: 0.5520 - val_acc: 0.7580\n",
      "Epoch 5/10\n",
      "1000/1000 [==============================] - 1s 939us/step - loss: 0.0076 - acc: 1.0000 - val_loss: 0.5733 - val_acc: 0.7630\n",
      "Epoch 6/10\n",
      "1000/1000 [==============================] - 1s 1ms/step - loss: 0.0034 - acc: 1.0000 - val_loss: 0.6028 - val_acc: 0.7650\n",
      "Epoch 7/10\n",
      "1000/1000 [==============================] - 1s 907us/step - loss: 0.0015 - acc: 1.0000 - val_loss: 0.6538 - val_acc: 0.7640\n",
      "Epoch 8/10\n",
      "1000/1000 [==============================] - 1s 867us/step - loss: 5.3216e-04 - acc: 1.0000 - val_loss: 0.7177 - val_acc: 0.7640\n",
      "Epoch 9/10\n",
      "1000/1000 [==============================] - 1s 875us/step - loss: 2.0735e-04 - acc: 1.0000 - val_loss: 0.7633 - val_acc: 0.7610\n",
      "Epoch 10/10\n",
      "1000/1000 [==============================] - 1s 865us/step - loss: 1.0772e-04 - acc: 1.0000 - val_loss: 0.7963 - val_acc: 0.7620\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x7f2297db7c88>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Simple model for sentence / document classification using CNN + global max pooling\n",
    "model = Sequential()\n",
    "model.add(Conv1D(filters=250, kernel_size=3, padding='same'))\n",
    "model.add(GlobalMaxPooling1D())\n",
    "model.add(Dense(250, activation='relu'))\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "\n",
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "model.fit(train_x, train_y, validation_data=(test_x, test_y), epochs=10, batch_size=32)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "conv1d_1 (Conv1D)            (None, 100, 250)          768250    \n",
      "_________________________________________________________________\n",
      "global_max_pooling1d_1 (Glob (None, 250)               0         \n",
      "_________________________________________________________________\n",
      "dense_1 (Dense)              (None, 250)               62750     \n",
      "_________________________________________________________________\n",
      "dense_2 (Dense)              (None, 1)                 251       \n",
      "=================================================================\n",
      "Total params: 831,251\n",
      "Trainable params: 831,251\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "model.summary()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
